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Abstract 

The statistical leverage scores of a matrix A are the squared row-norms of the matrix con- 
taining its (top) left singular vectors and the coherence is the largest leverage score. These 
quantities have been of interest in recently-popular problems such as matrix completion and 
Nystrom-based low-rank matrix approximation; in large-scale statistical data analysis appli- 
cations more generally; and since they define the key structural nonuniformity that must be 
dealt with in developing fast randomized matrix algorithms. Our main result is a randomized 
algorithm that takes as input an arbitrary n x d matrix A, with n ^ d, and that returns as 
output relative-error approximations to all n of the statistical leverage scores. The proposed 
algorithm runs (under assumptions on the precise values of n and d) in 0{nd\ogn) time, 
as opposed to the 0{nd^) time required by the naive algorithm that involves computing an 
orthogonal basis for the range of A. Our analysis may be viewed in terms of computing a 
relative-error approximation to an underconstrained least-squares approximation problem, or, 
relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several 
practically-important extensions of our basic result are also described, including the approxi- 
mation of so-called cross-leverage scores, the extension of these ideas to matrices with n ~ d, 
and the extension to streaming environments. 

1 Introduction 

The concept of statistical leverage measures the extent to which the singular vectors of a matrix 
are correlated with the standard basis and as such it has found usefulness recently in large-scale 
data analysis and in the analysis of randomized matrix algorithms |42l [301 [T9] . A related notion is 
that of matrix coherence, which has been of interest in recently popular problems such as matrix 
completion and Nystrom-based low-rank matrix approximation \i2\ W\\ . Defined more precisely 
below, the statistical leverage scores may be computed as the squared Euclidean norms of the 
rows of the matrix containing the top left singular vectors and the coherence of the matrix is the 
largest statistical leverage score. Statistical leverage scores have a long history in statistical data 
analysis, where they have been used for outlier detection in regression diagnostics [261 [13]. Statis- 
tical leverage scores have also proved crucial recently in the development of improved worst-case 
randomized matrix algorithms that are also amenable to high-quality numerical implementation 
and that are useful to domain scientists [m [30l [HI [I8l [40l [20] ; see [29] for a detailed discus- 
sion. The naive and best previously existing algorithm to compute these scores would compute 
an orthogonal basis for the dominant part of the spectrum of A, e.g., the basis provided by the 
Singular Value Decomposition (SVD) or a basis provided by a QR decomposition [23], and then 
use that basis to compute diagonal elements of the projection matrix onto the span of that basis. 
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We present a randomized algorithm to compute relative-error approximations to every statis- 
tical leverage score in time qualitatively faster than the time required to compute an orthogonal 
basis. For the case of an arbitrary n x d matrix A, with n ^ d, our main algorithm runs 
(under assumptions on the precise values of n and d, see Theorem [1] for an exact statement) 
in 0{ndlogn/e^) time, as opposed to the 0{nd'^) time required by the naive algorithm. As a 
corollary, our algorithm provides a relative-error approximation to the coherence of an arbitrary 
matrix in the same time. In addition, several practically-important extensions of the basic idea 
underlying our main algorithm are also described in this paper. 

1.1 Overview and definitions 

We start with the following definition of the statistical leverage scores of a matrix. 

Definition 1. Given an arbitrary n x d matrix A, with n > d, let U denote the n x d matrix 
consisting of the d left singular vectors of A, and let U(^i-^ denote the i-th row of the matrix U as 
a row vector. Then, the statistical leverage scores of the rows of A are given by 

ei=\M\l, (1) 

for i G {1, . . . ,n} ; the coherence 7 of the rows of A is 

7 = max £i, (2) 

«£{!,. ..,n} 

i.e., it is the largest statistical leverage score of A; and the (i, j)-cross-leverage scores are 

Cij = {U^i),U(^j)) , (3) 
i.e., they are the dot products between the i^^ row and the row of U . 

Although we have defined these quantities in terms of a particular basis, they clearly do not 
depend on that particular basis, but only on the space spanned by that basis. To see this, let Pa 
denote the projection matrix onto the span of the columns of A. Then, 

That is, the statistical leverage scores of a matrix A are equal to the diagonal elements of the 
projection matrix onto the span of its columns. Similarly, the (i, j)-cross-leverage scores are equal 
to the off-diagonal elements of this projection matrix, i.e., 

= (Pa)., = (5) 

Clearly, 0{nd^) time suffices to compute all the statistical leverage scores exactly: simply perform 
the SVD or compute a QR decomposition of A in order to obtain any orthogonal basis for the 
range of A and then compute the Euclidean norm of the rows of the resulting matrix. Thus, in 
this paper, we are interested in algorithms that run in oind'^) time. 

Several additional comments are worth making regarding this definition. First, since XliLi — 
||C/||^ = d, we can define a probability distribution over the rows of A as pi = tijd. As discussed 
below, these probabilities have played an important role in recent work on randomized matrix 
algorithms and an important algorithmic question is the degree to which they are uniform or 
nonuniform^ Second, one could also define leverage scores for the columns of a "tall" matrix 

^Observe that if U consists of d columns from the identity, then the leverage scores are extremely nonuniform: 
d of them are equal to one and the remainder are equal to zero. On the other hand, if U consists of d columns from 
a normalized Hadamard transform (see Section [2.31 for a definition), then the leverage scores are very uniform: all 
n of them are equal to djn. 
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A, but clearly those are all equal to one unless n < d or ^ is rank-deficient. Third, and more 
generally, given a rank parameter k, one can define the statistical leverage scores relative to the 
best rank-k approximation to A to be the n diagonal elements of the projection matrix onto the 
span of Afc, the best rank-A; approximation to A. 



1.2 Our main result 

Our main result is a randomized algorithm for computing relative-error approximations to every 
statistical leverage score, as well as an additive-error approximation to all of the large cross- 
leverage scores, of an arbitrary n x d matrix, with n 3> (i, in time qualitatively faster than the 
time required to compute an orthogonal basis for the range of that matrix. Our main algorithm 
for computing approximations to the statistical leverage scores (see Algorithm [T] in Section [3]) will 
amount to constructing a "randomized sketch" of the input matrix and then computing the Eu- 
clidean norms of the rows of that sketch. This sketch can also be used to compute approximations 
to the large cross- leverage scores (see Algorithm [2] of Section [3]) . 

The following theorem provides our main quality-of-approximation and running time result 
for Algorithm [TJ 

Theorem 1. Let A be a full-rank nxd matrix, with n ^ d; let e G (0, 1/2] be an error parameter; 
and recall the definition of the statistical leverage scores £i from DefinitionU^ Then, there exists a 
randomized algorithm (Algorithm[l\ of Section\^below) that returns values £i, for all i £ {1, . . . ,n} , 
such that with probability at least 0.8, 

' < (6) 

holds for all i G {1, . . . , n}. Assuming d <n < e'^ , the running time of the algorithm is 

0{nd\n {de~^) + nde-'^\nn + d^e'"^ {Inn) (in {de'^))) . 

Algorithm [1] provides a relative-error approximation to all of the statistical leverage scores £i 
of A and, assuming dlnd = o(j^). Inn = o{d), and treating e as a constant, its running 
time is o{nd'^), as desired. As a corollary, the largest leverage score (and thus the coherence) is 
approximated to relative-error in the o{nd^) time. 

The following theorem provides our main quality-of-approximation and running time result 
for Algorithm [21 

Theorem 2. Let A be a full-rank nxd matrix, with n ^ d; let e G (0, 1/2] be an error parameter; 
let K be a parameter; and recall the definition of the cross-leverage scores Cij from Definition [71 
Then, there exists a randomized algorithm (Algorithmic of Sectionl^ below) that returns the pairs 
{{i,j)} together with estimates {cij} such that, with probability at least 0.8, 

i- If c^j ^ — + 12e^j£j, then {i,j) is returned; if {i,j) is returned, then c]j > SOeiiij. 

K K 

a. For all pairs {i,j) that are returned, cfj — 30e£iij < cfj < cfj -\- 12e£i£j. 
This algorithm runs in 0{e~^nlnn -'r i^dln^ n) time. 

Note that by setting k = nlnn, we can compute all the "large" cross-leverage scores, i.e., those 
satisfying cfj > to within additive-error in O (ndln^n) time (treating e as a constant). If 

In^ n = o{d) the overall running time is o{nd^), as desired. 
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1.3 Significance and related work 

Our results are important for their applications to fast randomized matrix algorithms, as well as 
their applications in numerical linear algebra and large-scale data analysis more generally. 

Significance in tfieoretical computer science. The statistical leverage scores define 
the key structural nonuniformity that must be dealt with (i.e., either rapidly approximated 
or rapidly uniformized at the preprocessing step) in developing fast randomized algorithms 
for matrix problems such as least-squares regression [A0\ [20] and low-rank matrix approxima- 
tion \35\ HOl [191 [30l [TT] . Roughly, the best random sampling algorithms use these scores (or the 
generalized leverage scores relative to the best rank-A: approximation to A) as an importance 
sampling distribution to sample with respect to. On the other hand, the best random projection 
algorithms rotate to a basis where these scores are approximately uniform and thus in which 
uniform sampling is appropriate. See [29] for a detailed discussion. 

As an example, the CUR decomposition of [19', l30] essentially computes pi = ii/k, for all 
i G {1, . . . ,n} and for a rank parameter k, and it uses these as an importance sampling distri- 
bution. The computational bottleneck for these and related random sampling algorithms is the 
computation of the importance sampling probabilities. On the other hand, the computational 
bottleneck for random projection algorithms is the application of the random projection, which is 
sped up by using variants of the Fast Johnson-Lindenstrauss Transform [21 [3]. By our main result, 
the leverage scores (and thus these probabilities) can be approximated in time that depends on 
an application of a Fast Johnson-Lindenstrauss Transform. In particular, the random sampling 
algorithms of |18[ [T9l [30] for least-squares approximation and low-rank matrix approximation now 
run in time that is essentially the same as the best corresponding random projection algorithm 
for those problems |40] . 

Applications to numerical linear algebra. Recently, high-quality numerical implemen- 
tations of variants of the basic randomized matrix algorithms have proven superior to traditional 
deterministic algorithms |39 [ l38 [ l6]. An important question raised by our main results is how these 
will compare with an implementation of our main algorithm. More generally, density functional 
theory [8j and uncertainty quantification [7] are two scientific computing areas where computing 
the diagonal elements of functions (such as a projection or inverse) of very large input matrices 
is common. For example, in the former case, "heuristic" methods based on using Chebychev 
polynomials have been used in numerical linear algebra to compute the diagonal elements of the 
projector [8]. Our main algorithm should have implications in both of these areas. 

Applications in large-scale data analysis. The statistical leverage scores and the scores 
relative to the best rank-A: approximation to A are equal to the diagonal elements of the so- 
called "hat matrix" [261 114] . As such, they have a natural statistical interpretation in terms 
of the "leverage" or "influence" associated with each of the data points |26[ [131 114] . In the 
context of regression problems, the i*'' leverage score quantifies the leverage or infiuence of the 
j^th constraint/row of A on the solution of the overconstrained least squares optimization problem 
min^^ \\Ax — b\\2 and the (i, j)-th cross leverage score quantifies how much infiuence or leverage the 
i*'^ data point has on the j*'^ least-squares fit (see [MKEKH] for details). When applied to low- 
rank matrix approximation problems, the leverage score pj quantifies the amount of leverage or 
influence exerted by the j*'^ column of A on its optimal low-rank approximation. Historically, these 
quantities have been widely-used for outlier identification in diagnostic regression analysis [42l [15] . 

More recently, these scores (usually the largest scores) often have an interpretation in terms of 
the data and processes generating the data that can be exploited. For example, depending on the 
setting, they can have an interpretation in terms of high-degree nodes in data graphs, very small 
clusters in noisy data, coherence of information, articulation points between clusters, the value of 
a customer in a network, space localization in sensor networks, etc. [H [371 [Ml [271 [29] . I^i genetics. 
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dense matrices of size thousands by hundreds of thousands (a size scale at which even traditional 
deterministic QR algorithms fail to run) constructed from DNA Single Nucleotide Polymorphisms 
(SNP) data are increasingly common, and the statistical leverage scores can correlate strongly 
with other metrics of genetic interest \36\ I30j . Our main result will permit the computation of 
these scores and related quantities for significantly larger SNP data sets than has been possible 
previously [Ml US] ■ 

Remark. Lest there be any confusion, we should emphasize our main contributions. First, 
note that statistical leverage and matrix coherence are important concepts in statistics and ma- 
chine learning. Second, recall that several random sampling algorithms for ubiquitous matrix 
problems such as least-squares approximation and low-rank matrix approximation use leverage 
scores in a crucial manner; but until now these algorithms were il.{TsvD), where TsvD is the time 
required to compute a QR decomposition or a partial SVD of the input matrix. Third, note that, 
in some cases, o{Tsvd) algorithms exist for these problems based on fast random projections. 
But recall that the existence of those projection algorithms in no way implies that it is easy or 
obvious how to compute the statistical leverage scores efficiently. Fourth, one implication of our 
main result is that those random sampling algorithms can now be performed just as efficiently 
as those random projection algorithms; thus, the solution for those matrix problems can now be 
obtained while preserving the identity of the rows. That is, these problems can now be solved 
just as efficiently by using actual rows, rather than the arbitrary linear combinations of rows that 
are returned by random projections. Fifth, we provide a generalization to "fat" matrices and 
to obtaining the cross-leverage scores. Sixth, we develop algorithms that can compute leverage 
scores and related statistics even in streaming environments. 

1.4 Outline 

In Section [2l we will provide a brief review of relevant notation and concepts from linear algebra. 
Then, in Sections [3] and 01 we will present our main results: Section [3] will contain our main 
algorithm and Section 2] will contain the proof of our main theorem. Section [5] will then describe 
extensions of our main result to general "fat" matrices, i.e., those with n ~ d. Section [6] will 
conclude by describing the relationship of our main result with another related estimator for the 
statistical leverage scores, an application of our main algorithm to the under-constrained least- 
squares approximation problem, and extensions of our main algorithm to streaming environments. 

2 Preliminaries on linear algebra and fast random projections 

2.1 Basic linear algebra and notation 

Let [n] denote the set of integers {1, 2, . . . , n}. For any matrix A £ R"^'^, let ^(j), i G [n], denote 
the i-th row of j4 as a row vector, and let A^^\ j G [d] denote the j-th column of A as a column 
vector. Let ||A||^ = Y^^=iYl'j=i ^fj denote the square of the Frobenius norm of A, and let 
11^112 = sup |ja;||2=i 11^^112 denote the spectral norm of A. Relatedly, for any vector x G M", its 
Euclidean norm (or ^2-iiorm) is the square root of the sum of the squares of its elements. The dot 
product between two vectors x,y £ will be denoted {x,y), or alternatively as x^y. Finally, 
let Cj G M", for all i G [n], denote the standard basis vectors for R"- and let In denote the n x n 
identity matrix. 

Let the rank of A he p < min{n, d}, in which case the SVD of A is denoted hy A = U'EV'^, 
where U G R"-^^, S G R''^^, and V G R'^^P. (For a general matrix X, we wih write X = 
Ux'^x^x ■) G [p] denote the i-th singular value of A, and let o"max(^) and o"min(^) 
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denote the maximum and minimum singular values of A, respectively. The Moore-Penrose pseu- 
doinverse of A is the d x n matrix defined by A^ = VT,~^U'^ [33]. Finally, for any orthogonal 
matrix U S M"^^, let C/"*" G denote an orthogonal matrix whose columns are an or- 

thonormal basis spanning the subspace of that is orthogonal to the subspace spanned by the 
columns of U {i.e., the range of U). It is always possible to extend an orthogonal matrix [/ to a 
full orthonormal basis of as [U U-^]. 

The SVD is important for a number of reasons [23]. For example, the projection of the 
columns of A onto the k left singular vectors associated with the top k singular values gives the 
best rank-A; approximation to A in the spectral and Frobenius norms. Relatedly, the solution to 
the least-squares (LS) approximation problem is provided by the SVD: given an n x d matrix 
A and an n-vector 6, the LS problem is to compute the minimum ^2-iiorm vector x such that 
ll^lx — 6II2 is minimized over all vectors x € M'^. This optimal vector is given by x^pt = A^b. 
We call a LS problem overconstrained (or overdetermined) if n > d and underconstrained (or 
underdetermined) \i n < d. 

2.2 The Fast Johnson-Lindenstrauss Transform (FJLT) 

Given e > and a set of points xi,. . . ,Xn with Xi G W^, a e- Johnson-Lindenstrauss Transform 
(e-JLT), denoted H S M^'^'^, is a projection of the points into W such that 

(l-e)||x,||^<||nx,||^<(l + e)||x,||^. (7) 

To construct an e-JLT with high probability, simply choose every entry of H independently, equal 
to ±Y^3/r with probability 1/6 each and zero otherwise (with probability 2/3) \\\. Let TijLT be 
a matrix drawn from such a distribution over r x d matricesJl Then, the following lemma holds. 

Lemma 1 (Theorem 1.1 of [1]). Let xi,...,Xn be an arbitrary (but fixed) set of points, where 
Xi £ and let < e < 1/2 be an accuracy parameter. If 

r > -77 (121nn + 61n-- 
e"^ \ 

then, with probability at least 1 — 5, Hjlt £ W^'^ is an e-JLT . 

For our main results, we will also need a stronger requirement than the simple e-JLT and so 
we will use a version of the Fast Johnson-Lindenstrauss Transform (FJLT), which was originally 
introduced in [21 [3]. Consider an orthogonal matrix U E M"^"^, viewed as d vectors in M". A 
FJLT projects the vectors from M" to MJ" , while preserving the orthogonality of U; moreover, it 
does so very quickly. Specifically, given e > 0, H € M^^" is an e-FJLT for U if 

• \\ld - U^n^UUW^ <e, and 

• given any X the matrix product UX can be computed in 0{nd\nr) time. 

The next lemma follows from the definition of an e-FJLT and it's proof can be found in [18^ I20j. 

Lemma 2. Let A by any matrix in M"^'^ with n <^ d and rank(^) = d. Let the SVD of A be 
A = UJ:V^, let n be an e-FJLT for U (with < e < 1/2) and let ^ = UU = U^T.,$V^ . Then, 
all the following hold: 

rank(nA) = rank(nf/) = rank([/) = rank(A) = d, (8) 

< e/(l-e), and (9) 
(UA)^ = V^-\UU)^. (10) 



^When no confusion can arise, we will use 11 jlt to refer to this distribution over matrices as well as to a specific 
matrix drawn from this distribution. 
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2.3 The Subsampled Randomized Hadamard Transform (SRHT) 

One can use a Randomized Hadamard Transform (RHT) to construct, with high probabiHty, an 
e-FJLT. Our main algorithm will use this efficient construction in a crucial wayo Recall that the 
(unnormalized) n x n matrix of the Hadamard transform Hn is defined recursively by 



2n 



Hn —Hn 



with Hi = 1. The n x n normalized matrix of the Hadamard transform is equal to 

Hn = Hnl^Jn. 



From now on, for simplicity and without loss of generality, we assume that n is a power of 2 
and we will suppress n and just write H. (Variants of this basic construction that relax this 
assumption and that are more appropriate for numerical implementation have been described 
and evaluated in [6].) Let D £ M"^" be a random diagonal matrix with independent diagonal 
entries Da = +1 with probability 1/2 and Da = —1 with probability 1/2. The product HD 
is a RHT and it has three useful properties. First, when applied to a vector, it "spreads out" 
its energy. Second, computing the product HDx for any vector x G R" takes 0{nlog2n) time. 
Third, if we only need to access r elements in the transformed vector, then those r elements can 
be computed in 0(n log2 r) time [1]. The Subsampled Randomized Hadamard Transform (SRHT) 
randomly samples (according to the uniform distribution) a set of r rows of a RHT. 

Using the sampling matrix formalism described previously \17\ [T8| \T9\ I20j . we will represent 
the operation of randomly sampling r rows of an n x d matrix A using an r x n linear sampling 
operator S*^. Let the matrix Ilpjix = HD be generated using the SRHT@ The most important 
property about the distribution Hfjlt is that if r is large enough, then, with high probability, 
n_F jLT generates an e-FJLT. We summarize this discussion in the following lemma (which is 
essentially a combination of Lemmas 3 and 4 from [20j, restated to fit our notation). 

Lemma 3. Let lipjix £ he generated using the SRHT as described above and let U € M"^'^ 

(n^ d) be an (arbitrary but fixed) orthogonal matrix. If 

^ ^ 142dln(40n(i) f 30"^ dln{AOnd) 



e2 V e2 



then, with probability at least 0.9, Hfjlt is an e-FJLT for U. 



3 Our main algorithmic results 

In this section, we will describe our main results for computing relative-error approximations to 
every statistical leverage score (see Algorithm [T]) as well as additive-error approximations to all 
of the large cross- lever age scores (see Algorithm [2]) of an arbitrary matrix A S M"^*^, with d. 
Both algorithms make use of a "randomized sketch" of A of the form 74(niA)^n2, where Hi is 
an e-FJLT and n2 is an e-JLT. We start with a high-level description of the basic ideas. 

''Note that the RHT has also been crucial in the development of o{nd?) randomized algorithms for the general 
overconstrained LS problem [20] and its variants have been used to provide high-quality numerical implementations 
of such randomized algorithms |39l IS]. 

^ Again, when no confusion can arise, we will use Hfjlt to denote a specific SRHT or the distribution on 
matrices implied by the randomized process for constructing an SRHT. 
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3.1 Outline of our basic approach 

Recall that our first goal is to approximate, for all i £ [n], the quantities 



\eJU 



(11) 



where Cj is a standard basis vector. The hard part of computing the scores £j according to 
Eqn. (jlip is computing an orthogonal matrix U spanning the range of A, which takes 0{nd?) 
time. Since UU^ = AA\ it follows that 



zTaa^ 



(AA^) 



(12) 



where the first equality follows from the orthogonality of (the columns of) U. The hard part of 
computing the scores ii according to Eqn. ([12]) is two-fold: first, computing the pseudoinverse; 
and second, performing the matrix-matrix multiplication of A and A^ . Both of these procedures 
take 0{nd?) time. As we will see, we can get around both of these bottlenecks by the judicious 
application of random projections to Eqn. (|12p . 

To get around the bottleneck of 0{n(P) time due to computing A^ in Eqn. p2|). we will 
compute the pseudoinverse of a "smaller" matrix that approximates A. A necessary condition 
for such a smaller matrix is that it preserves rank. So, nai've ideas such as uniformly sampling 
ri <C rows from A and computing the pseudoinverse of this sampled matrix will not work well 
for an arbitrary A. For example, this idea will fail (with high probability) to return a meaningful 
approximation for matrices consisting of n — 1 identical rows and a single row with a nonzero 
component in the direction perpendicular to that the identical rows; finding that "outlying" row 
is crucial to obtaining a relative-error approximation. This is where the SRHT enters, since it 
preserves important structures of A, in particular its rank, by first rotating yl to a random basis 
and then uniformly sampling rows from the rotated matrix (see [20j for more details). More 
formally, recah that the SVD of A is UT.V'^ and let Ui G M^i^" be an e-FJLT for U (using, for 
example, the SRHT of Lemma[3]with the appropriate choice for ri). Then, one could approximate 
the £i's of Eqn. 1^ by 



ejA{U,A) 



(13) 



where we approximated the n x d matrix A by the ri x d matrix UiA. Computing A (IliA)^ in 
this way takes O (ndri) time, which is not efficient because ri > d (from Lemma [3]). 

To get around this bottleneck, recall that we only need the Euclidean norms of the rows of 



the matrix A (ni^)T G M"^*^!. Thus, we can further reduce the dimensionality of this matrix by 
using an e-JLT to reduce the dimension ri = 0(d) to r2 = O(lnn). Specifically, let IlJ G M^'^^^^ 
be an e-JLT for the rows of ^(Hi^d)^ (viewed as n vectors in R^'^) and consider the matrix 
Q = A (UiA)^ 112. This n x r2 matrix Q, may be viewed as our "randomized sketch" of the rows 
of AA^ . Then, we can compute and return 



ejAiUiA)^ n2 



(14) 



for each i G [n], which is essentially what Algorithm [T] does. Not surprisingly, the sketch 
can be used in other ways: for example, by considering the dot product between 
two different rows of this randomized sketching matrix (and some additional manipulations) Al- 
gorithm [2] approximates the large cross-leverage scores of A. 
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Input: A G (with SVD A = UT,V^), error parameter e G (0, 1/2]. 

Output: G [n]. 

1. Construct Hi G M^^^" to be an e-FJLT for U, using Lemma [3] with 

/ dlnn /dlnn 
ri = U [ — 5— In 



g2 g2 



2. Compute Hi A G W^""^ and its SVD, Hi A = Un^A^n^AV^^^. Let 

(Alternatively, R could be computed by a QR factorization of Hi A.) 

3. View the normalized rows of AR^^ G M"^'^ as n vectors in M'^, and construct 

II2 G R'^^'~2 to be an e-JLT for vectors (the aforementioned n vectors and their 
— n pairwise sums), using Lemma [U with 

r2 = O (e~^ Inn) . 

4. Construct the matrix product VL = AR^^Ii2- 

~ II 1 1 2 

5. For all i G [n] compute and return ii = 



Algorithm 1: Approximating the (diagonal) statistical leverage scores Hi. 
3.2 Approximating all the statistical leverage scores 

Our first main result is Algorithm [H which takes as input an n x d matrix A and an error 
parameter e G (0,1/2], and returns as output numbers li, i G [n]. Although the basic idea to 
approximate || (^^^)(j) || was described in the previous section, we can improve the efficiency of 
our approach by avoiding the full sketch of the pseudoinverse. In particular, let A = UiA and 
let its SVD be 1 = U 2T. ^Vj . Let R~^ = ViT.~^ and note that i?"^ G M'''"^ is an orthogonalizer 

AAA A A ° 

for A since = AR~^ is an orthogonal matrix^ In addition, note that AR~^ is approximately 
orthogonal. Thus, we can compute AR~^ and use it as an approximate orthogonal basis for A 
and then compute £{ as the squared row- norms of AR^^. The next lemma states that this is 
exactly what our main algorithm does; even more, we could get the same estimates by using any 
"orthogonalizer" of IIi^. 

Lemma 4. Let R^^ be such that Q = UiAR^^ is an orthogonal matrix with rank(Q) = 
rank(ni^). Then, \\{AR''^)^i^\\l = li. 

Proof. Since A = UiA has rank d (by Lemma [2]) and R~^ preserves this rank, R~^ is a d x d 
invertible matrix. Using A = QR and properties of the pseudoinverse, we get [A] = R^^Q^ . 



''This preprocessing is reminiscent of how [391 16] preprocessed the input to provide numerical implementations 
of the fast relative-error algorithm [2D] for approximate LS approximation. From this perspective, Algorithm [T] can 
be viewed as specifying a particular basis Q, i.e., as choosing Q to be the left singular vectors of 111^4. 
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Thus, 



□ 



3.3 Approximating the large cross-leverage scores 

By combining Lemmas [6] and [7] (in Section Hr2] below) with the triangle inequaUty, one immediately 
obtains the following lemma. 

Lemma 5. Let Q be either the sketching matrix constructed by AlgorithmUl i.e., O = AR^^Il2, 
or $7 = y4 (riiyl)^ 112 as described in Section \3.1[ Then, the pairwise dot-products of the rows of 
O are additive- error approximations to the leverage scores and cross-leverage scores: 



3e 



That is, if one were interested in obtaining an approximation to all the cross-leverage scores to 
within additive error (and thus the diagonal statistical leverage scores to relative-error), then the 
algorithm which first computes Q followed by all the pairwise inner products achieves this in time 
T(ri) + O (n^r2), where T(0) is the time to compute Q from Section [3.21 and r2 = 0(e~^lnn)l2| 
The challenge is to avoid the computational complexity and this can be done if one is interested 
only in the large cross- lever age scores. 

Our second main result is provided by Algorithms [2] and [3l Algorithm [2] takes as input an 
n X d matrix A, a parameter k > 1, and an error parameter e G (0, 1/2], and returns as output 
a subset of [n] x [n] and estimates Cij satisfying Theorem [2l The first step of the algorithm is to 
compute the matrix Q = AR~^Il2 constructed by Algorithm [TJ Then, the algorithm Algorithm [3] 
as a subroutine to compute "heavy hitter" pairs of rows from a matrix. 



Input: A G 



nnxd 



and parameters k > 1, e G (0, 1/2]. 



Output: The set H consisting of pairs together with estimates Cij satisfying 
Theorem [2j 

1. Compute the n x r2 matrix Q, = AR^^Il2 from Algorithm [TJ 

2. Use Algorithm [3] with inputs Q and k' = k{1 + 30de) to obtain the set 7i 
containing all the At'-heavy pairs of O. 

3. Return the pairs in H as the K-heavy pairs of A. 



Algorithm 2: Approximating the large (off-diagonal) cross-leverage scores Cj 



®The exact algorithm which computes a basis first and then the pairwise inner products requires 0{nd? + n^d) 
time. Thus, by using the sketch, we can already improve on this running time by a factor of d/Inn. 
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Input: X € W^^"^ with rows xi, . . . , x„ and a parameter k > 1. 

Output: % = {(i,j),c.tj} containing all heavy (unordered) pairs. The pair 
{hj)-,Cij G ^ if and only if cfj = (xi.Xjf > \\X'^X\\p/K. 
1: Compute the norms ||xi||2 and sort the rows according to norm, so that 

^1 II 2 — ' ' ' — ll-^n II 2 ■ 

2: H <r- {}; zi ^ n; Z2 <r- 1. 

3: while Z2 < zi do 

4: while ||a^2i 112113^22 II2 < 

5: Z2 ^ 2:2 + 1- 

6: if Z2 > z\ then 
7: return Ji. 

8: end if 

9: end while 

10: for each pair (i, j) where i = z\ and j G {^2, ^2 + Ij • • • > z\\ do 

11: c\ = {x,,x^)'^. 

12: if cf^. > 1 1 X'^^ 1 1 5, then 

13: add and Cjj to "H. 

14: end if 

15: Z\ Z\ — 1. 

16: end for 

17: end while 

18: return H.. 



Algorithm 3: Computing heavy pairs of a matrix. 

4 Proofs of our main theorems 

4.1 Sketch of the proof of Theorems [1] and [2] 

We will start by providing a sketch of the proof of Theorems [1] and [2l A detailed proof is provided 
in the next two subsections. We condition all the analysis on the events that Hi S is an 

e-FJLT for U and Vi2 G W^^'"' is an e-JLT for r? points in W'^ . Note that by setting b = 0.1 in 
Lemma[Tl both events hold with probability at least 0.8, which is equal to the success probability 
of Theorems [1] and [2j The algorithm estimates ii = ||ui||2) where Ui = ej A(n.iA)^Il2. First, 
observe that the sole purpose of 112 is to improve the running time while preserving pairwise inner 
products; this is achieved because 112 is an e-JLT for points. So, the results will follow if 

efA{UiA)^{UiA)^fA'^ej « efUU^ej 

and (ni^)^^ can be computed efficiently. Since Hi is an e-FJLT for U, where A = U'EV'^ , (HiA)^ 
can be computed in 0{ndlnri + ricF) time. By Lemma[2l (IIiA)^ = VT,~^(IliU)^ , and so 

efA{UiA)\{UiA)^fA^ej = efU{UiU)\UiU)^^U^ej. 

Since Hi is an e-FJLT for U, it follows that {UiU)'' (IliU)'' ^ Id, i.e., that IliU is approximately 
orthogonal. Theorem [1] follows from this basic idea. However, in order to prove Theorem^ having 
a sketch which preserves inner products alone is not sufficient. We also need a fast algorithm to 
identify the large inner products and to relate these to the actual cross-leverage scores. Indeed, it 
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is possible to efficiently find pairs of rows in a general matrix with large inner products. Combining 
this with the fact that the inner products are preserved, we obtain Theorem [2j 



4.2 Proof of Theorem [T] 



We condition all our analysis on the events that Hi G M*"!^" is an e-FJLT for U and 112 S M''i^^2 
is an e-JLT for points in W"^ . Define 



Ui 
Ui 



efA{UiA)\ and 



Then, ii = \\ui\\2 and ii = Wu 
Lemma 6. For i,j G [n], 



■« 1 1 2 • 



The proof will follow from the following two lemmas. 



- {ui,Uj) 



< 



1 - e 



Lemma 7. For i,j G [n], 



\{ui,Uj) - {ui,Uj)\ < 2e||ni||2||nj||2- 



(15) 



(16) 



Lemma [6] states that {ui,Uj) is an additive error approximation to all the cross-leverage scores 
(i 7^ j) and a relative error approximation for the diagonals {i = j). Similarly, Lemma [7] shows 
that these cross- leverage scores are preserved by 112. Indeed, with i = j, from Lemma [6] we have 
~ ^ T^'^i' ^^'^ from Lemma [7] we have \£i — £i\ < 2eii. Using the triangle inequality and 
e<l/2: 



< 



£i £% 



+ 



P- — £■ 



< 



1 - e 



+ 2e]i^< Aeii. 



The theorem follows after rescaling e. 



Proof of Lemma O Let A = U'EV'^ . Using this SVD of A and Eqn. ([TO]) in Lemma O 

{ui,Uj) = efU^V^V^-'^{UiU)^niU)^'^^-W^V^U^ej = ef U {UiU)^UiU)^'^ ej . 
By performing standard manipulations, we can now bound ^(j)) ~ 

\{U^^),U^,)) - {u^,uj)\ = eJUU^ej-eJUiUiU)Hn,U)^'^U^ej 



< 



rm2 Irimh 



Let the SVD of ^ = HiU be \I' = C/ij-S^fV^, where Vq, is a full rotation in d dimensions (because 
rank(A) = rank(ni[/)). Then, ^-t^t^ = l^SjVJ. Thus, 



\{U(^^),U^J)) - {u^,Uj)\ < ||/d-y*SjVJ||2 ||C/(i)| 



2 Wum 



l^cJ- ^*l2 II^WII2 Il^{i)ll2^ 



I2 Wum 



where we used the fact that Vq,V^ = V^Vx^/ = 1^ and the unitary invariance of the spectral norm. 
Finally, using Eqn. ([9]) of Lemma [2] the result follows. 
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Proof of Lemma O Since 112 is an e-JLT for vectors, it preserves tlie norms of an arbitrary 
(but fixed) collection of m? vectors. Let Xi = Ui/\\ui\\2- Consider the following vectors: 

Xi for i G [n], and 
Xi + Xj for i,j G [n],i / j. 

By the e-JLT property of Ii2 ^-i^d the fact that ||xj||2 = 1, 

1 — e < ||xin2||2 < 1 + e for i G [n], and (17) 

(1 - e)\\xi + Xj\\\ < \\xiU2 + Xjn2||2 < (1 + (■)\\xi + XjWl for ij £ [n],i / j. (18) 

Combining Eqns. (jl7p and (jlSp after expanding the squares using the identity ||a + 6||^ = ||a||^ + 
+ 2{a,b), substituting ||xj|| = 1, and after some algebra, we obtain 

{xi,Xj) - 2e < {xiU2,XjU2) < {xi,Xj) + 2e. 

To conclude the proof, multiply throughout by ||nj||||iij|| and use the homogeneity of the inner 
product, together with the linearity of 112, to obtain: 

(-Uj, Uj) — 2e||'Ui|| llujll < (^4112, 'Ujn2) < (uj, -Uj) + 2e||-Ui|| llujll- 

Running Times. By Lemma [H we can use T^iiA^^n^ instead of (HiA)^ and obtain the same 
estimates. Since Hi is an e-FJLT, the product HiA can be computed in 0{ndlnri) while its 
SVD takes an additional 0{ri(f) time to return VuiA^^l ^ Since 112 G R'^''''^, we obtain 

VniA5^ni-'^2 £ M'^^'"^ in an additional 0{r2(P) time. Finally, premultiplying by A takes 0{ndr2) 
time, and computing and returning the squared row-norms of = AVniASn^n2 G M"^*"^ takes 
O (nr2) time. So, the total running time is the sum of all these operations, which is 

0{nd\iiri + ndr2 + rid^ + r2d^). 

For our implementations of the e-JLTs and e-FJLTs (5 = 0.1), ri = O (e^^d (Inn) (In (e^^dlnn))) 
and r2 = 0(e~^lnn). It follows that the asymptotic running time is 

0{ndln {de~^) + nde'"^ Inn + d^e'^ (Inn) (in (de^^))) • 

To simplify, suppose that d < n < e'^ and treat e as a constant. Then, the asymptotic running 
time is 

O [ndlnn + d^ (Inn) (Ind)) . 

4.3 Proof of Theorem [2] 

We first construct an algorithm to estimate the large inner products among the rows of an 
arbitrary matrix X G M"^*" with n > r. This general algorithm will be applied to the matrix 
Cl = AVsiiA^u^A^2- Let xi, . . . ,Xn denote the rows of X; for a given k > 1, the pair (i,j) is 
heavy if 

{xi,Xjf > -\\X^X\\i. 
By the Cauchy-Schwarz inequality, this implies that 

\\x^\\l\\xj\\l>-\\X^X\\l, (19) 

so it suffices to find all the pairs for which Eqn. (jl9p holds. We will call such pairs norm- 
heavy. Let s be the number of norm- heavy pairs satisfying Eqn. (jl9p . We first bound the number 
of such pairs. 
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Lemma 8. Using the above notation, s < nr. 
Proof. Observe that 




Ikiiykj II2 — [ y^ Iki|l2 I — 11-^1 




where di, . . . , o"r. are the singular values of X. To conclude, by the definition of a heavy pair, 

i,j i=l \i=l / 

where the last inequality follows by Cauchy-Schwarz. □ 

Algorithm[3]starts by computing the norms \\xi\\^ for all i E [n] and sorting them (in O {nr + n Inn) 
time) so that we can assume that ||a;i||2 < • • • < ||3;n||2- Then, we initialize the set of norm-heavy 
pairs to ^ = {} and we also initialize two pointers zi = n and Z2 = 1. The basic loop in the 
algorithm checks if Z2 > z\ and stops if that is the case. Otherwise, we increment Z2 to the first 
pair (21,22) that is norm-heavy. If none of pairs are norm heavy (i.e., Z2 > z\ occurs), then 
we stop and output "H; otherwise, we add (zi, 22); (^i, 22 + 1), • • • , (-21, ^i) to H. This basic loop 
computes all pairs (21, i) with i < z\ that are norm-heavy. Next, we decrease z\ by one and if 
z\ < Z2 we stop and output "H; otherwise, we repeat the basic loop. Note that in the basic loop 
Z2 is always incremented. This occurs whenever the pair (21,22) is not norm-heavy. Since 22 can 
be incremented at most n times, the number of times we check whether a pair is norm-heavy and 
fail is at most n. Every successful check results in the addition of at least one norm-heavy pair 
into Ti and thus the number of times we check if a pair is norm heavy (a constant-time operation) 
is at most n + s. The number of pair additions into 7i is exactly s and thus the total running 
time is 0{nr -|- nlnn -|- s). Finally, we must check each norm- heavy pair to verify whether or not 
it is actually heavy by computing s inner products vectors in R*"; this can be done in 0{sr) time. 
Using s < Kr we get the following lemma. 

Lemma 9. Algorithmic returns % including all the heavy pairs of X in 0{nr + Kr"^ + nlnn) 
time. 

To complete the proof, we apply Algorithm [3] with = AT4i^^Sj^j^^n2 S R"^''^, where r2 = 
O(e^^lnn). Let ui, . . . ,Un denote the rows of and recall that A = UT,V'^ . Let ui, . . . ,Un 
denote the rows of U ; then, from Lemma [5l 

{ui,Uj) - Y--^ll'"j|lll%ll < {ui,Uj) < {ui,Uj) + Y^^lluilllltijll. (20) 



Given e, k, assume that for the pair of vectors Uj and u 



■J 



/ \2 \ ^WttTttW'^ I 10 II l|2|| ||2 ^ , 10 II l|2|| ||2 

(Ui,Uj) > —\\U U \\ „ + 12e\\Ui\\ \\Uj\\ = — h 12e Uj \\Uj\\ , 

where the last equality follows from ||C/^C/|||, = = d. By Eqn. ([20|) . after squaring and 

using e < 0.5, 

(uj, Uj)^ — 12e||ui||^e||nj||^ < {ui,Uj)'^ < {ui,Uj)^ + 30e\\ui\\'^\\uj\\'^ , . (21) 
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Thus, {ui,Uj)'^ > d/K and summing Eqn. ([^T]) over all i,j we get ||r2'^il||^ < d + 30ed^, or, 
equivalently. 



d> 



1 + 30de 
We conclude that 

{u^,Ujf >- + l2e\\u^f\\Ujf =^ {Ui,Ujf >-> , ^jj J , • (22) 

K K k{1 + 30(ie) 

By construction. Algorithm [3] is invoked with k' = and thus it finds all pairs with 

{ui,Uj)'^ > ||r2'^0||^/K' = d/K. This set contains all pairs for which 

2 ^ ^ , io,ll„, .l|2|i , ||2 



{ui,Uj) > - + I2e\\ui 



K 

Further, since every pair returned satisfies {ui,Uj)'^ > d/K, by Eqn. ([2T]) . Cij > d/K — 30eii£j. This 
proves the first claim of the Theorem; the second claim follows analogously from Eqn. (j2ip . 

Using Lemma [H the running time of our approach is O {nr2 + k't"^ + n In n) . Since r2 = 
O (e^^lnn), and, by Eqn. (I22p . k' = Kjlri-'^illl^/fi < k{1 + 30(ie), the overall running time is 
O ie^'^n In n + e^^Kcf In^ n) . 



5 Extending our algorithm to general matrices 

In this section, we will describe an important extension of our main result, namely the compu- 
tation of the statistical leverage scores relative to the best rank-A; approximation to a general 
matrix A. More specifically, we consider the estimation of leverage scores for the case of general 
"fat" matrices, namely input matrices A G M"^"^, where both n and d are large, e.g., when d = n 
or d = G(n). Clearly, the leverage scores of any full rank n x n matrix are exactly uniform. The 
problem becomes interesting if one specifies a rank parameter k <^ min{n,fi}. This may arise 
when the numerical rank of A is small {e.g., in some scientific computing applications, more than 
99% of the spectral norm of A may be captured by some k <^ min{n,fi} directions), or, more 
generally, when one is interested in some low rank approximation to A {e.g., in some data anal- 
ysis applications, a reasonable fraction or even the majority of the Frobenius norm of A may be 
captured by some k <^ min{n, d} directions, where k is determined by some exogenously-specified 
model selection criterion). Thus, assume that in addition to a general n x d matrix A, a rank 
parameter k < min{n, d} is specified. In this case, we wish to obtain the statistical leverage 
scores £{ = ||(^fc)(i)||2 = UkTi^V/F , the best rank-A; approximation to A. Equivalently, we 

seek the normalized leverage scores 

P^ = f • (23) 

Note that J2i=iPi = 1 since Y17=i^i = ll^^fcllF = ^• 

Unfortunately, as stated, this is an ill-posed problem. Indeed, consider the degenerate case 
when A = In (i.e., the n x n identity matrix). In this case, Ut is not unique and the leverage 
scores are not well-defined. Moreover, for the obvious (^) equivalent choices for Uk, the leverage 
scores defined according to any one of these choices do not provide a relative error approximation 
to the leverage scores defined according to any other choices. More generally, removing this trivial 
degeneracy does not help. Consider the matrix 



0(1- -iVn- 



nnxn 
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In this example, the leverage scores for are well defined. However, as 7 — t- 0, it is not possible 
to distinguish between the top-A; singular space and its complement. This example suggests that it 
should be possible to obtain some result conditioning on the spectral gap at the k^^ singular value. 
For example, one might assume that cr| — cr^_(_-^ > 7 > 0, in which case the parameter 7 would play 
an important role in the ability to solve this problem. Any algorithm which cannot distinguish 
the singular values with an error less than 7 will confuse the k-th and {k + l)-th singular vectors 
and consequently will fail to get an accurate approximation to the leverage scores for A^. 

In the following, we take a more natural approach which leads to a clean problem formulation. 
To do so, recall that the leverage scores and the related normalized leverage scores of Eqn. (|23p are 
used to approximate the matrix in some way, e.g., we might be seeking a low-rank approximation 
to the matrix with respect to the spectral [19] or the Frobenius pJJ norm, or we might be seeking 
useful features or data points in downstream data analysis applications 1^36* i30j, or we might 
be seeking to develop high-quality numerical implementations of low-rank matrix approximation 
algorithms [23], etc. In all these cases, we only care that the estimated leverage scores are a good 
approximation to the leverage scores of some "good" low-rank approximation to A. The following 
definition captures the notion of a set of rank-fc matrices that are good approximations to A. 

Definition 2. Given A G W^^'^ and a rank parameter k <C mm{n,d}, let A^ be the best rank-k 
approximation to A. Define the set of rank-k matrices that are good approximations to A as 
follows (fori = 2,F): 

Se = eW"^ : rank(X) = A: and \\A - X\\^ < {I + e)\\A - A^W^^ (24) 

We are now ready to define our approximations to the normalized leverage scores of any matrix 
A G W^^'^ given a rank parameter k <^ min{n, d}. Instead of seeking to approximate the pi 
of Eqn. (j23p (a problem that is ill-posed as discussed above), we will be satisfied if we can 
approximate the normalized leverage scores of some matrix X & S^. This is an interesting 
relaxation of the task at hand: all matrices X that are sufficiently close to Ak are essentially 
equivalent, since they can be used instead of A^ in applications. 

Definition 3. Given A G M"^'' and a rank parameter k <^ min {n, d}, let be the set of matrices 
of Definition\M We call the numbers pi (for all i G [n]) P- approximations to the normalized 
leverage scores of Aj. (the best rank-k approximation to A) if, for some matrix X ^ S^, 



P\\{Ux)ii) 



|2 



Pi > ' ^ and y^^Pi 



Here Ux G M"^ is the matrix of the left singular vectors of X. 

Thus, we will seek algorithms whose output is a set of numbers, with the requirement that those 
numbers are good approximations to the normalized leverage scores of some matrix X £ 
(instead of ^fc). This removes the ill-posedness of the original problem. Next, we will give two 
examples of algorithms that compute such /3-approximations to the normalized leverage scores of 
a general matrix A with a rank parameter k for two popular norms, the spectral norm and the 
Frobenius norm. 

5.1 Leverage Scores for Spectral Norm Approximators 

Algorithm|3]approximates the statistical leverage scores of a general matrix A with rank parameter 
k in the spectral norm case. It takes as inputs a matrix A gR"^*^ with rank(A) = p and a rank 
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Input: A G M"'^'^ with rank(yl) = p and a rank parameter k <^ p 
Output: pi^i G [n] 

1. Construct 11 G M'^^^fc -^yj^j^ entries drawn in i.i.d. trials from the normal 
distribution 7V(0, 1). 

2. Compute B = {AA^Y AU G R"^2fc^ ^-^^ ^ -^^ ^^^^ ([26]) . 

3. Approximately compute the statistical leverage scores of the "tall" matrix B by 
calling Algorithm [1] with inputs B and e; let £i (for all i G [n]) be the outputs of 
Algorithm [TJ 

4. Return 

P» = — ^ (25) 

for all i £ \n\. 



Algorithm 4: Approximating the statistical leverage scores of a general matrix A (spectral 
norm case). 



parameter k <^ p, and outputs a set of numbers pi for all i £ [n], namely our approximations to 
the normalized leverage scores of A with rank parameter k. 

The next lemma argues that there exists a matrix X G M""^*^ of rank k that is sufficiently 
close to A (in particular, it is a member of with constant probability) and, additionally, can 
be written as X = BY, where Y G R^fexfc -g ^ matrix of rank k. A version of this lemma was 
essentially proven in [23], but see also [38j for computational details; we will use the version of 
the lemma that appeared in |10j . Note that for our purposes in this section, the computation of 
Y is not relevant and we defer the reader to \24:\ [TO] for details. 

Lemma 10 (Spectral Sketch). Given A G M"^'^ of rank p, a rank parameter k such that 2 < k < p, 
and an error parameter e such that < e < 1, /ei 11 G W^^"^^ be a standard Gaussian matrix (with 
entries selected in i.i.d. trials from Af{0, 1)). If B = (AA^) AH, where 



q> 



ln[l + ^j^ + e^l^ymm{n,d} -k 
21n(l + e/10) - 1/2 



(26) 



then there exists a matrix X G M"^^ of rank k satisfying X = BY (with Y G M^fexfej such that 



E[\\A-xy< (i + Y^j \\A-Ak\\,. 



The matrix B can be computed in O (ndkq) time. 



This version of the above lemma is proven in |10)rl Now, since X has rank k, it follows that 



\A-XL > \\A-A 



fe II2 



and thus we can consider the non-negative random variable ||j4 — X| 



'^More specifically, the proof may be found in Lemma 32 and in particular in Eqn. (14) in Section A. 2; note that 
for our purposes here we replaced e/V^ by e/10 after adjusting q accordingly. 



17 



|A — ^A;||2 and apply Markov's inequality to get that 



\A-X\ 



k\\2 



holds with probability at least 0.9. Thus, X G 5^ with probability at least 0.9. 

The next step of the proposed algorithm is to approximately compute the leverage scores of 
B € M"^2fc .^jg^ Algorithm[TJ Under the assumptions of TheoremlU this step runs in O [nke''^ In n) 
time. Let Ux S W^^^ be the matrix containing the left singular vectors of the matrix X of 
Lemma [lOl Then, since X = BY by Lemma [TOl it follows that 

Ub = [Ux Ur] 

is a basis for the subspace spanned by the columns of B. Here Ur G W^^^ is an orthogonal matrix 
whose columns are perpendicular to the columns of Ux- Now consider the approximate leverage 
scores li computed by Algorithm [1] and note that (by Theorem [1]), 



B, 



< e 



holds with probability at least 0.8 for all i £ [n]. It follows that 



E 



',<(l + e)j;||(f/£ 



[l + e)^\\UB\ 



2{l + e)k. 



Finally, 



El 



-.1 



En 
7 = 1 ■ 



> 



> 



> 



X, 



+ 



En 
.7 = 1 • 



X, 



En 
7 = 1 



1 



X, 



iUx)u 



2(l + e) k 

jk are the normalized leverage scores of the matrix X. Recall that X G 5^ 



Clearly, \,^x)(;,) 

with probability at least 0.9 and use Definition [3] to conclude that the scores %>{ of Eqn. ([25 
approximations to the normalized leverage scores of A with rank parameter k. The 



are 



.2(l+e) 

following Theorem summarizes the above discussion: 

pnxd 



Theorem 3. Given A G M"^ , a rank 'parameter k, and an accuracy parameter e, Algorithm^ 



computes a set of normalized leverage scores pi that are 



l-e 



-approximations to the normalized 



.2(1+^), 

leverage scores of A with rank parameter k with probability at least 0.7. The proposed algorithm 
runs in 



O ndk 



In (min{n, d}) 
In (1 + e) 



+ nke In n 



time. 
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Input: A G M"'^'^ with rank(yl) = p and a rank parameter k <^ p 
Output: pi^i £ [n] 

1. Let r be as in Eqn. (I28p and construct 11 G M.'^^'^ whose entries are drawn in i.i.d. 
trials from the normal distribution M{0, 1). 

2. Compute B = AU £ R"^^ 

3. Compute a matrix Q £ M"^^ whose columns form an orthonormal basis for the 
column space of B. 



4. Compute the matrix Q^A £ W^'^ and its left singular vectors UqTj^ £ 



prxd 



5. Let Uqtj^ i^ £ denote the top k left singular vectors of the matrix Q^A (the 
first k columns of UqTj^) and compute, for all i € [n], 



iQUQ-A,k)Jl- (27) 



6. Return pi = li/k for all i £ [n]. 



Algorithm 5: Approximating the statistical leverage scores of a general matrix A (Frobenius 
norm case). 

5.2 Leverage Scores for Frobenius Norm Approximators. 

Algorithm[5]approximates the statistical leverage scores of a general matrix A with rank parameter 
k in the Frobenius norm case. It takes as inputs a matrix A gR"^°' with rank(^) = p and a rank 
parameter k <^ p, and outputs a set of numbers pi for all i £ [n], namely our approximations to 
the normalized leverage scores of A with rank parameter k. It is worth noting that Y27=i ~ 

II 1 1 2 II 1 1 2 

II QC/gT^fc 11^ = ||t^QT/^fc||^ = k and thus the pi sum up to one. The next lemma argues that 
there exists a matrix X £ M"^'^ of rank k that is sufficiently close to A (in particular, it is a 
member of with constant probability). Unlike the previous section (the spectral norm case), 
we will now be able to provide a closed-form formula for this matrix X and, more importantly, 
the normalized leverage scores of X will be exactly equal to the pi returned by our algorithm. 
Thus, in the parlance of Definition [3l we will get a 1-approximation to the normalized leverage 
scores of A with rank parameter k. 

Lemma 11 (Frobenius Sketch). Given A £ M"^'^ of rank p, a rank parameter k such that 
2 < k < p, and an error parameter e such that < e < 1, let H £ M.'^^'^ be a standard Gaussian 
matrix (with entries selected in i.i.d. trials from M{0, 1) ) with 



r > k + 

Let B = Ali and let X he as in Eqn. i29\}. Then, 

E 

The matrix B can he computed in O {ndke^^^ time. 



10/c ■ 
+ 1 

e 



(28) 



I^-^IIf 



2 



< {l + -^]\\A-Ak\\F 
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Let 

X = Q{Q^A)j^eW"^, (29) 

where [Q'^ A^^ is the best rank-A; approximation to the matrix Q^A; from standard hnear algebra, 
[Q'^A)^ = UqTj^ i^Uqj.^ j^Q'^A. Then, the above lemma is proven in jlOj]^ Now, since X has rank 
k, it follows that — > ||^ — ylfc||?;. and thus we can consider the non-negative random 

1 1 2 1 1 2 

variable — — — A^H^ and apply Markov's inequality to get that 

\\A - X\\l - \\A - AkWl < 4^ - Ml 

holds with probability at least 0.9. Rearranging terms and taking square roots of both sides 
implies that 

\\A - X\\p < Vl + e\\A - AkWp < (1 + e) \\A - Ak\\p. 

Thus, X G with probability at least 0.9. To conclude our proof, recall that Q is an orthonormal 
basis for the columns of B. From Eqn. (I29p . 

X = Q {Q'^A)^ = QUQTA^kUgTj^^f.Q'^A = QUQTA,k^QTA,kyQTA,k- 

In the above, Sqt^ ^ G M*^^'^ is the diagonal matrix containing the top k singular values of Q^A 
and Vqta k ^ ^^^'^ is the matrix whose rows are the top k right singular vectors of Q^A. Thus, 
the left singular vectors of the matrix X are exactly equal to the columns of the orthogonal matrix 
QUgTA^k't it now follows that the ii of Eqn. (|27p are the leverage scores of the matrix X and, 
finally, that the pi returned by the proposed algorithm are the normalized leverage scores of the 
matrix X. 

We briefly discuss the running time of the proposed algorithm. First, we can compute B in 
0{ndr) time. Then, the computation of Q takes 0{nr'^) time. The computation of A takes 
0{ndr) time and the computation of Uqt^ takes 0{dr'^) time. Thus, the total time is equal to 
O {ndr + (n + (i)r^) . The following Theorem summarizes the above discussion. 

Theorem 4. Given A E M"^'^, a rank parameter k, and an accuracy parameter e, Algorithmic 
computes a set of normalized leverage scores pi that are 1- approximations to the normalized lever- 
age scores of A with rank parameter k with probability at least 0.7. The proposed algorithm runs 
in O [ndke^^ + (n + (i)A;^e~^) time. 



6 Discussion 



We will conclude with a discussion of our main results in a broader context: understanding the 
relationship between our main algorithm and a related estimator for the statistical leverage scores; 
applying our main algorithm to solve under-constrained least squares problems; and implementing 
variants of the basic algorithm in streaming environments. 



6.1 A related estimator for the leverage scores 

Magdon-Ismail in [28j presented the following algorithm to estimate the statistical leverage scores: 
given as input an n x c? matrix A, with d, the algorithm proceeds as follows. 

• Compute n^, where the O (j^) x n matrix 11 is a SRHT or another FJLT. 



More specifically, the proof may be found in Lemma 33 in Section A. 3; note that for our purposes here we set 
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• Compute X = (n^)tn. 

• For t = 1, . . . , n, compute the estimate wt = A^^X^*) and set wt = max | |. 

• Return the quantities Pi = Wi/ ui,/, for i G [n]. 

|28| argued that the output pi achieves an O(ln^n) approximation to all of the (normalized) 
statistical leverage scores of A in roughly 0{ncP' / \nn) time. (To our knowledge, prior to our 
work here, this is the only known estimator that obtains any nontrivial provable approximation 
to the leverage scores of a matrix in o{nd?) time.) To see the relationship between this estimator 
and our main result, recall that 

ii = eJUU^a = ejAA^Ci = xfyi, 

where the vector xj = ej A is cheap to compute and the vector yi = A^ei is expensive to 
compute. The above algorithm effectively approximates yi = A^Ci via a random projection 
as iji = (n^)^^nej, where 11 is a SRHT or another FJLT. Since the estimates xjiji are not 
necessarily positive, a truncation at the negative tail, followed by a renormalization step, must be 
performed in order to arrive at the final estimator returned by the algorithm. This truncation- 
renormalization step has the effect of inflating the estimates of the small leverage scores by an 
O(ln'^n) factor. By way of comparison, Algorithm [1] essentially computes a sketch of AA'^ of the 
form ^(n^)^^!!"^ that maintains positivity for each of the row norm estimates. 

Although both Algorithm [1] and the algorithm of this subsection estimate AA^ by a matrix 
of the form ^(Ilyl)^!!"^, there are notable differences. The algorithm of this subsection does not 
actually compute or approximate AA'^ directly; instead, it separates the matrix into two parts 
and computes the dot product between efA and (HA) '''116,. Positivity is sacrificed and this leads 
to some complications in the estimator; however, the truncation step in interesting, since, despite 
the fact that the estimates are "biased" (in a manner somewhat akin to what is obtained with 
"thresholding" or "regularization" procedures), we still obtain provable approximation guaran- 
tees. The algorithm of this subsection is simpler (since it uses an application of only one random 
projection), albeit at the cost of weaker theoretical guarantees and a worse running time that 
our main algorithm. A direction of considerable practical interest is to evaluate empirically the 
performance of these two estimators, either for estimating all the leverage scores or (more inter- 
estingly) for estimating the largest leverage scores for data matrices for which the leverage scores 
are quite nonuniform. 

6.2 An application to under-constrained least-squares problems 

Consider the following under-constrained least-squares problem: 

min \\Ax-b\\o, (30) 

where A G M"^'^ has much fewer rows than columns, i.e., n ^ d. It is well-known that we can 
solve this problem exactly in 0{'n?d) time and that the minimal ^2-iiorm solution is given by 
Xopt = A'^h. For simplicity, let's assume that the input matrix A has full rank (i.e., rank(j4) = n) 
and thus ||vlxopt — hW^ = 0. 

In this section, we will argue that Algorithm [6] computes a simple, accurate estimator Xopt for 
Xopt- In words. Algorithm [6] samples a small number of columns from A (note that the columns 
of A correspond to variables in our under-constrained problem) and uses the sampled columns 
to compute Xopt- However, in order to determine which columns will be included in the sample. 
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the algorithm will make use of the statistical leverage scores of the matrix A^; more specifically, 
columns (and thus variables) will be chosen with probability proportional to the corresponding 
statistical leverage score. We will state Algorithm [6] assuming that these probabilities are parts 
of the input; the following theorem is our main quality-of- approximation result for Algorithm [6l 

Theorem 5. Let A G M"^'^ be a full-rank matrix with n <C d; let e G (0,0.5] be an accuracy 
parameter; let 5 G (0,1) be a failure probability; and let Xopt = A^b be the minimal £2-norm 
solution to the least-squares problem of Eqn. [3U\) . Let pi > 0, i G [d], be a set of probabilities 
satisfying Yli=i Pi — ^ ^'^'^ 



P. 



n 



(31) 



for some constant j3 G (0, 1]. (Here V G M is the matrix of the right singular vectors of A.) If 
Xopt is computed via Algorithmic then, with probability at least 1 — 5, 

\\Xopt ^opt II 2 — ll'^opt|l2 • 

Algorithmic runs in O (n^e^^/S^-*^ In (n/e/35) + nd) time. 

Proof: Let the singular value decomposition of the full-rank matrix A he A = UTiV'^ with 
U G M"^", S G M"^", and V G R'^^"; note that all the diagonal entries of S are strictly positive 
since A has full rank. We can now apply Theorem 4 of Section 6.1 of [20j to gellfl that 



\ln - V'^SS'^V\ 



\V^V -V^SS'^VW^ < e 



(32) 



for our choice of r with probability at least 1 — 6. Note that V^S G M"^*" (with r > n) and let 
CTj (V'^S) denote the singular values of V'^S for all i G [n]; the above inequality implies that for 
all i G In] 



1 



< 



V^SS^V\\^ < e < 0.5. 



Thus, all the singular values of V'^S are strictly positive and hence V'^S has full rank equal to 
n. Also, using e < 0.5, 



|l-.r2(yT^)|<_^<2e. 
We are now ready to prove our theorem: 

" opt-xopth = A'^{AS)^^{AS)H-A^b 



(33) 



X. 



In the above derivations we substituted the SVD of A, dropped terms that do not change unitarily 
invariant norms, and used the fact that V'^ S and S have full rank in order to simplify the 
pseudoinverse. Now let {V'^ {V'^ = In + E and note that Eqn. (I33p and the fact that 
V'^ S has full rank imply 



l^^ll 



In - {V^S)'^^ {V^S)'^ = max |l - a'"^ {V^ S) \ < 2e 



'We apply Theorem 4 of Section 6.1 of 20 with A = V'^ and note that \\V'^\\\ = n > 1, ||V' 



1, and 



V 



(i)- 
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Thus, we conclude our proof by observing that 



WXovt-XoptW^ = \\{In + E)T.'^U'h-T.-^U'b\\^ 
= ||^S-^C/^6||2 

< 11^112 ||S-iC/^6||2 

< 2e||xopt||2. 

In the above we used the fact that ||a;opt||2 = ||^''^^||2 ~ ll^^^^^'^^IL ~ ll^'^^'^^L' '^^^ 
running time of the algorithm follows by observing that AS is an n x r matrix and thus computing 
its pseudoinverse takes 0{'n?r) time; computing Xopt takes an additional 0{nr + dn) time. 

o 

Input: A £ R"^^, b £ M", error parameter e S (0, .5], failure probability 6, and a set of 
probabilities pi (for all i G [d]) summing up to one and satisfying Eqn. ()3ip . 

Output: Xopt G IK'^- 

1- Letr = |^ln(^). 

2. Let S G Mf^^^ be an all-zeros matrix. 

3. For t = 1,. . . ,r do 

• Pick if G [d] such that Pr (i^ = i) = pi. 

• = '^ly/rpTf 

4. Return i^pj = A^ {AS)'^^ {AS)'^ b. 

Algorithm 6: Approximately solving under-constrained least squares problems. 

We conclude the section with a few remarks. First, assuming that e, /3, and 6 are constants 
and nlnn = o{d), it immediately follows that Algorithm [6] runs in o{ii?d) time. It should be clear 
that we can use Theorem [T] and the related Algorithm [T] to approximate the statistical leverage 
scores, thus bypassing the need to exactly compute them. Second, instead of approximating 
the statistical leverage scores needed in Algorithm [H we could use the randomized Hadamard 
transform (essentially post-multiply A by a randomized Hadamard transform to make all statis- 
tical leverage scores uniform). The resulting algorithm could be theoretically analyzed following 
the lines of [20j. It would be interesting to evaluate experimentally the performance of the two 
approaches in real data. 

6.3 Extension to streaming environments 

In this section, we consider the estimation of the leverage scores and of related statistics when the 
input data set is so large that an appropriate way to view the data is as a data stream [32]. In this 
context, one is interested in computing statistics of the data stream while making one pass (or 
occasionally a few additional passes) over the data from external storage and using only a small 
amount of additional space. For sn n x d matrix A, with n ^ d, small additional space means 
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that the space complexity only depends logarithmically on the high dimension n and polynomially 
on the low dimension d. When we discuss bits of space, we assume that the entries of A can be 
discretized to O(logn) bit integers, though all of our results can be generalized to arbitrary word 
sizes. The general strategy behind our algorithms is as follows. 

• As the data streams by, compute TA, for an appropriate problem-dependent linear sketching 
matrix T, and also compute 1174, for a random projection matrix 110 

• After the first pass over the data, compute the matrix R~^, as described in Algorithm [H 
corresponding to HA (or compute the pseudoinverse of HA or the R matrix from any other 
QR decomposition of A). 

• Compute TAR^^Il2, for a random projection matrix 112, such as the one used by Algo- 
rithm [TJ 

With the procedure outlined above, the matrix T is effectively applied to the rows of AR~^Il2, 
i.e., to the sketch of A that has rows with Euclidean norms approximately equal to the row norms 
of U, and pairwise inner products approximately equal to those in U. Thus statistics related to 
U can be extracted. 

Large Leverage Scores. Given any n x d matrix A in a streaming setting, it is known how 
to find the indices of all rows A^^^ of A for which ||A(j)||2 ^ ''"II^IIfi foi" ^ parameter r, and in 
addition it is known how to compute a (1 -|- e)-approximation to Hi for these large rows. The 
basic idea is to use the notion of ^2-sampling on matrix A, namely, to sample random entries 
Aij with probability 74?j/||^|||n. A single entry can be sampled from this distribution in a single 
pass usmg 0{e'^log^{nd)) bits of space [m E]. More precisely, these references demonstrate 
that there is a distribution over O {de~'^ log^ {nd}) x n matrices T for which for any fixed matrix 
A G W""^, there is a procedure which given TA, outputs a sample {i,j) G [n] x [d] with probability 

(1 lb e) ii^'iji ± n~^^^h Technically, these references concern sampling from vectors rather than 

matrices, so T{A) is a linear operator which treats ^4 as a length-nd vector and applies the 
algorithm of |3H [S]. However, by simply increasing the number of rows in T by a factor of the 
small dimension d, we can assume T is left matrix multiplication. By considering the marginal 
along [n], the probability that i = a, for any a G [n], is 

By the coupon collector problem, running 0(t^^ log r^^) independent copies is enough to find 
a set containing all rows j4(j) for which ||j4(j)||2 > r||A|||;,, and no rows j4(j) for which ||j4(j)||2 < 
^ll^lll^ with probability at least 0.99. 

When applied to our setting, we can apply a random projection matrix 11 and a linear sketching 
matrix T which has 0{dT~^e~'^ log'^(n) log t~^) rows in the following manner. First, TA and HA 
are computed in the first pass over the data; then, at the end of the first pass, we compute R~^; 
and finally, we compute T AR~^Il2, for a random projection matrix 112. This procedure effectively 
applies the matrix T to the rows of AR~^Yi2, which have norms equal to the row norms oiU , up 
to a factor of 1 -|- e. The multiplication at the end by 112 serves only to speed up the time for 

^°In the offline setting, one would use an SRHT or another FJLT, while in the streaming setting one could use 
either of the following. If the stream is such that one sees each entire column of A at once, then one could do an 
FJLT on the column. Alternatively, if one see updates to the individual entries of A in an arbitrary order, then 
one could apply any sketching matrix, such as those of [Tj or of [16) . 
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processing TAR-^. Thus, by the results of [Ml IS], we can find ah the leverage scores that 
are of magnitude at least t||C/|||^ in small space and a single pass over the data. By increasing the 
space by a factor of 0(e~^logn), we can also use the £2-samples to estimate the norms ||C/(i)||2 
for the row indices i that we find. 

Entropy. Given a distribution />, a statistic of p of interest is the entropy of this distribution, 
where the entropy is defined as H{p) = p{i) log2(l//5(i)). This statistic can be approximated in 
a streaming setting. Indeed, it is known that estimating H{p) up to an additive e can be reduced 
to (1 + e)-approximation of the £p-norm of the vector . . . ,p{n)), for 0(log 1/e) different 

p G (0,1) [25j. Here e = e/(log^l/e • logn). When applied to our setting, the distribution of 
interest is p{i) = 2\\^ii)\\2- ^o compute the entropy of this distribution, there exist sketching 
matrices T for providing (1 + e)-approximations to the quantity Fp{F2) of an n x d matrix A, 
where Fp{F2) is defined as '}2^=i Il^(i)ll2^) using 0(e~^ log^ nlog 1/e) bits of space (see Theorem 1 
of [21] )• Thus, to compute the entropy of the leverage score distribution, we can do the following. 
First, maintain TA and HA in the first pass over the data, where T is a sketching matrix for 
-Pp(-^2); P S (0, 1). At the end of the first pass, compute and finally, compute T AR~^Ii2-, 

which effectively applies the Fp(F2)-estimation matrix T to the rows of the matrix AR~^Ii2- 
Therefore, by the results of \lb\ I21| . we can compute an estimate which is within an additive 
e of H{p) using 0((ie~'^log^nlog^^ 1/e) bits of space and a single pass. We note that it is also 
possible to estimate H{p) up to a multiplicative 1 + e factor using small, but more, space; see, 
e.g., m- 

Sampling Row Identities. Another natural problem is that of obtaining samples of rows of 
A proportional to their leverage score importance sampling probabilities. To do so, we use ^2- 
sampling [31, _5j as used above for finding the large leverage scores. First, compute TA and HA 
in the first pass over the data stream; then, compute i?~^; and finally, compute TAR~^. Thus, 
by applying the procedures of [5j a total of s times independently, we obtain s samples ii, . . . , is, 
with replacement, of rows of A proportional to ||f^(ii)||2) • • • i Hi; ^-^-j to their leverage score. 
The algorithm requires 0(sde~^ log^ n) bits of space and runs in a single pass. To obtain more 
than just the row identities ii, . . . e.g., to obtain the actual samples, one can read off these 
rows from ^ in a second pass over the matrix. 
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